Mean Squared Logarithmic Error (MSLE)#
Mean Squared Logarithmic Error (MSLE) is a regression metric for non-negative targets.
It computes mean squared error after a log transform (log(1 + y)), so it mainly measures relative / multiplicative differences.
Learning goals#
Define MSLE precisely (domain, weights, multioutput)
Build intuition for what MSLE does (and does not) penalize
Implement MSLE from scratch in NumPy
Visualize MSLE behavior across scales with Plotly
Use MSLE as a training objective via log-space linear regression
Prerequisites#
NumPy basics (vectorization, shapes)
Comfort with derivatives for least squares (or gradients)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import mean_squared_error, mean_squared_log_error
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Definition#
Let \(y \in \mathbb{R}^n\) be targets and \(\hat{y} \in \mathbb{R}^n\) predictions. MSLE is defined as:
Equivalently, with the shorthand \(\log1p(t)=\log(1+t)\):
Domain constraints#
Because we evaluate \(\log(1+y)\), we must have:
(scikit-learn enforces non-negativity).
Weighted MSLE#
With non-negative sample weights \(w_i\):
Multioutput#
For \(y,\hat{y}\in\mathbb{R}^{n\times d}\) you compute MSLE per output and then aggregate (uniform average or a weighted average).
Units / interpretation#
MSLE is dimensionless (it operates on log values). A common variant is:
Because \(\log1p(\hat{y})-\log1p(y)=\log\frac{1+\hat{y}}{1+y}\), RMSLE is the root-mean-square of a log-ratio error.
# A tiny example
y_true = np.array([3.0, 5.0, 2.5, 7.0])
y_pred = np.array([2.5, 5.0, 4.0, 8.0])
msle = mean_squared_log_error(y_true, y_pred)
rmsle = float(np.sqrt(msle))
msle, rmsle
(0.03973012298459379, 0.19932416558108)
2) Intuition: squared error in log-space#
MSLE is just MSE applied to the transformed values \(z=\log(1+y)\). Two key consequences:
Large values are compressed. Differences among big targets contribute less than they would under MSE on the original scale.
MSLE mostly measures relative error. Using a first-order Taylor approximation:
so for small errors \(\varepsilon = \hat{y}-y\):
and therefore:
which is a smoothed squared relative error (for large \(y\), \(1+y\approx y\); near zero, \(1+y\approx 1\) so MSLE behaves closer to MSE).
# Visualize the log1p transform and its inverse
y_grid = np.linspace(0, 2000, 600)
z_grid = np.linspace(0, np.log1p(y_grid.max()), 600)
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Transform: z = log(1+y)", "Inverse: y = exp(z) - 1"),
)
fig.add_trace(go.Scatter(x=y_grid, y=np.log1p(y_grid), mode="lines"), row=1, col=1)
fig.add_trace(go.Scatter(x=z_grid, y=np.expm1(z_grid), mode="lines"), row=1, col=2)
fig.update_xaxes(title_text="y", row=1, col=1)
fig.update_yaxes(title_text="log(1+y)", row=1, col=1)
fig.update_xaxes(title_text="z", row=1, col=2)
fig.update_yaxes(title_text="exp(z)-1", row=1, col=2)
fig.update_layout(title="log1p compresses large values (expm1 reverses it)")
fig.show()
3) Multiplicative symmetry (ratios, not differences)#
A single-sample log error can be rewritten as a ratio:
So the per-sample squared log error is:
This means MSLE treats over- and under-prediction symmetrically in multiplicative terms: predicting a factor \(k\) too large (in \(1+y\)) has the same loss as predicting a factor \(k\) too small.
# Per-sample MSLE as a function of the ratio r = (1+y_pred)/(1+y_true)
r = np.geomspace(0.05, 20.0, 600)
sq_log_error = np.log(r) ** 2
fig = px.line(
x=r,
y=sq_log_error,
log_x=True,
title="Per-sample squared log error depends on the ratio (1+y_pred)/(1+y_true)",
labels={"x": "ratio r = (1+y_pred)/(1+y_true)", "y": "(log r)^2"},
)
fig.add_vline(x=1.0, line_dash="dash", line_color="black")
# Show the symmetry: r and 1/r yield the same loss
for r0 in [2.0, 3.0, 5.0]:
fig.add_trace(
go.Scatter(
x=[r0, 1.0 / r0],
y=[np.log(r0) ** 2, np.log(1.0 / r0) ** 2],
mode="markers",
marker=dict(size=10),
name=f"r={r0} and 1/r",
)
)
fig.show()
4) MSLE vs MSE across scales#
Suppose the model makes a constant relative error, e.g. it always over-predicts by 20%.
Under MSE, the per-sample contribution \((\hat{y}-y)^2\) grows like \(y^2\). Large targets dominate the metric.
Under MSLE, the per-sample contribution is approximately a constant (for large \(y\)) because it is driven by the ratio \((1+\hat{y})/(1+y)\).
The plot below shows this effect.
y_true = np.geomspace(0.1, 2000.0, 200)
ratio = 1.20
y_pred = ratio * y_true
sq_err_mse = (y_pred - y_true) ** 2
sq_err_msle = (np.log1p(y_pred) - np.log1p(y_true)) ** 2
target_log_ratio_sq = float(np.log(ratio) ** 2)
fig = make_subplots(rows=1, cols=2, subplot_titles=("MSE contribution", "MSLE contribution"))
fig.add_trace(
go.Scatter(x=y_true, y=sq_err_mse, mode="markers", name="(y_pred - y_true)^2"),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(x=y_true, y=sq_err_msle, mode="markers", name="(log1p diff)^2"),
row=1,
col=2,
)
fig.add_hline(y=target_log_ratio_sq, line_dash="dash", line_color="black", row=1, col=2)
fig.update_xaxes(title_text="y_true", type="log", row=1, col=1)
fig.update_yaxes(title_text="squared error", type="log", row=1, col=1)
fig.update_xaxes(title_text="y_true", type="log", row=1, col=2)
fig.update_yaxes(title_text="squared log error", row=1, col=2)
fig.update_layout(
title="Constant 20% relative error: MSE explodes with scale, MSLE stays (nearly) constant",
showlegend=False,
)
fig.show()
5) NumPy implementation from scratch#
Below is a small NumPy implementation that mirrors the core ideas in sklearn.metrics.mean_squared_log_error:
supports 1D targets
(n,)and multioutput targets(n, d)optional
sample_weight(weighted average over samples)multioutput:'raw_values','uniform_average', or output weights(d,)
It also enforces the MSLE domain constraint: no negative values.
def mean_squared_log_error_np(
y_true,
y_pred,
*,
sample_weight=None,
multioutput="uniform_average",
):
"""Compute MSLE with optional sample weights and multioutput support.
Parameters
----------
y_true, y_pred:
Array-like with shape (n,) or (n, d). Must be non-negative.
sample_weight:
None or array-like with shape (n,)
multioutput:
- 'raw_values' -> return array of shape (d,)
- 'uniform_average' -> return scalar (mean over outputs)
- array-like (d,) -> weighted average over outputs
Returns
-------
float or np.ndarray
"""
y_true_arr = np.asarray(y_true, dtype=float)
y_pred_arr = np.asarray(y_pred, dtype=float)
if y_true_arr.shape != y_pred_arr.shape:
raise ValueError(f"Shape mismatch: y_true {y_true_arr.shape} vs y_pred {y_pred_arr.shape}")
if np.any(y_true_arr < 0) or np.any(y_pred_arr < 0):
raise ValueError("MSLE is only defined for non-negative y_true and y_pred")
if y_true_arr.ndim == 1:
y_true_arr = y_true_arr[:, None]
y_pred_arr = y_pred_arr[:, None]
elif y_true_arr.ndim != 2:
raise ValueError("y_true/y_pred must be 1D or 2D")
n_samples, n_outputs = y_true_arr.shape
log_true = np.log1p(y_true_arr)
log_pred = np.log1p(y_pred_arr)
sq_log_error = (log_true - log_pred) ** 2 # (n, d)
if sample_weight is not None:
w = np.asarray(sample_weight, dtype=float)
if w.ndim != 1 or w.shape[0] != n_samples:
raise ValueError(f"sample_weight must have shape ({n_samples},), got {w.shape}")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
denom = float(np.sum(w))
if denom == 0.0:
raise ValueError("sum(sample_weight) must be > 0")
msle_per_output = np.sum(sq_log_error * w[:, None], axis=0) / denom
else:
msle_per_output = np.mean(sq_log_error, axis=0)
if isinstance(multioutput, str):
if multioutput == "raw_values":
return msle_per_output
if multioutput == "uniform_average":
return float(np.mean(msle_per_output))
raise ValueError(
"multioutput must be 'raw_values', 'uniform_average', or array-like of output weights"
)
output_weights = np.asarray(multioutput, dtype=float)
if output_weights.shape != (n_outputs,):
raise ValueError(f"multioutput weights must have shape ({n_outputs},), got {output_weights.shape}")
weight_sum = float(np.sum(output_weights))
if weight_sum == 0.0:
raise ValueError("sum(multioutput weights) must be > 0")
return float(np.dot(msle_per_output, output_weights) / weight_sum)
# Quick check vs scikit-learn
y_true_2d = rng.uniform(0.0, 8.0, size=(30, 3))
y_pred_2d = np.clip(y_true_2d + rng.normal(scale=0.8, size=(30, 3)), 0.0, None)
w = rng.uniform(0.5, 2.0, size=30)
ours = mean_squared_log_error_np(y_true_2d, y_pred_2d, sample_weight=w, multioutput="raw_values")
sk = mean_squared_log_error(y_true_2d, y_pred_2d, sample_weight=w, multioutput="raw_values")
ours, sk, np.allclose(ours, sk)
(array([0.0611, 0.0547, 0.058 ]), array([0.0611, 0.0547, 0.058 ]), True)
6) Gradient w.r.t. predictions (useful for optimization)#
For one sample, define the log residual:
MSLE is \(J = \frac{1}{n}\sum_i e_i^2\). Using the chain rule:
Notice the factor \(\frac{1}{1+\hat{y}_i}\): gradients are naturally scaled like relative errors.
def msle_grad_y_pred(y_true, y_pred):
"""Gradient of MSLE w.r.t y_pred (1D arrays, uniform averaging)."""
y_true = np.asarray(y_true, dtype=float)
y_pred = np.asarray(y_pred, dtype=float)
if y_true.shape != y_pred.shape:
raise ValueError("shape mismatch")
if np.any(y_true < 0) or np.any(y_pred < 0):
raise ValueError("MSLE is only defined for non-negative values")
e = np.log1p(y_pred) - np.log1p(y_true)
return (2.0 / y_true.size) * e / (1.0 + y_pred)
# Finite-difference check
y_true = rng.uniform(0.0, 5.0, size=7)
y_pred = rng.uniform(0.0, 5.0, size=7)
g_ana = msle_grad_y_pred(y_true, y_pred)
eps = 1e-6
g_num = np.zeros_like(y_pred)
for i in range(y_pred.size):
y_pred_p = y_pred.copy()
y_pred_m = y_pred.copy()
y_pred_p[i] += eps
y_pred_m[i] -= eps
y_pred_m[i] = max(0.0, y_pred_m[i])
f_p = mean_squared_log_error_np(y_true, y_pred_p)
f_m = mean_squared_log_error_np(y_true, y_pred_m)
g_num[i] = (f_p - f_m) / (2 * eps)
g_ana, g_num, np.allclose(g_ana, g_num, rtol=2e-4, atol=2e-4)
(array([ 0.0186, 0.0384, -0.0793, -0.0236, -0.0056, 0.0618, 0.0478]),
array([ 0.0186, 0.0384, -0.0793, -0.0236, -0.0056, 0.0618, 0.0478]),
True)
7) Using MSLE to optimize a model: log-space linear regression#
A very common way to train a model for MSLE is to predict in log-space. Define:
and fit a linear model in \(z\):
Then convert back to the original scale with:
This guarantees \(\hat{y}_i \ge 0\), and the objective becomes:
which is just least squares on \(z\) (a convex quadratic).
Gradients#
Let \(r_i = \hat{z}_i - z_i\). Then:
We can minimize this with gradient descent (or solve it in closed form).
# Synthetic data with multiplicative / log-normal noise
n = 260
x = rng.uniform(0.0, 6.0, size=n)
b0_true = 0.8
b1_true = 0.9
sigma = 0.35
z_true = b0_true + b1_true * x
z_obs = z_true + rng.normal(0.0, sigma, size=n)
y = np.expm1(z_obs) # y >= 0
# Visualize in original space vs log space
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Data in original y scale", "Same data in z = log(1+y) scale"),
)
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="y"), row=1, col=1)
fig.add_trace(go.Scatter(x=x, y=np.log1p(y), mode="markers", name="log1p(y)"), row=1, col=2)
fig.update_xaxes(title_text="x", row=1, col=1)
fig.update_yaxes(title_text="y", row=1, col=1)
fig.update_xaxes(title_text="x", row=1, col=2)
fig.update_yaxes(title_text="log(1+y)", row=1, col=2)
fig.update_layout(title="Log space often reveals a simpler (linear) relationship")
fig.show()
def fit_line_msle_gd(x, y, *, lr=0.05, n_steps=400):
"""Fit y_hat = exp(b0 + b1*x) - 1 by minimizing MSLE via GD in log-space."""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
if np.any(y < 0):
raise ValueError("y must be non-negative for MSLE")
z = np.log1p(y)
n = x.size
b0 = float(z.mean())
b1 = 0.0
history = {"step": [], "msle": [], "b0": [], "b1": []}
for t in range(n_steps):
z_hat = b0 + b1 * x
r = z_hat - z
msle_t = float(np.mean(r**2))
grad_b0 = (2.0 / n) * np.sum(r)
grad_b1 = (2.0 / n) * np.sum(x * r)
b0 -= lr * grad_b0
b1 -= lr * grad_b1
history["step"].append(t)
history["msle"].append(msle_t)
history["b0"].append(b0)
history["b1"].append(b1)
return (b0, b1), history
(b0_gd, b1_gd), hist = fit_line_msle_gd(x, y, lr=0.06, n_steps=350)
# Closed-form least squares in z-space
X = np.column_stack([np.ones_like(x), x])
z = np.log1p(y)
b0_ols, b1_ols = np.linalg.lstsq(X, z, rcond=None)[0]
(b0_gd, b1_gd), (b0_ols, b1_ols)
((0.7382724858458961, 0.9098627244134785),
(0.7381944920965867, 0.9098824642871552))
# Training curve
fig = px.line(
x=hist["step"],
y=hist["msle"],
title="Gradient descent minimizing MSLE (in log space)",
labels={"x": "step", "y": "MSLE"},
)
fig.show()
# Compare MSLE-trained model to an MSE-on-y linear baseline
b0_mse, b1_mse = np.linalg.lstsq(X, y, rcond=None)[0]
x_line = np.linspace(x.min(), x.max(), 300)
X_line = np.column_stack([np.ones_like(x_line), x_line])
y_pred_msle = np.expm1(b0_ols + b1_ols * x)
y_line_msle = np.expm1(X_line @ np.array([b0_ols, b1_ols]))
y_pred_mse = b0_mse + b1_mse * x
y_line_mse = X_line @ np.array([b0_mse, b1_mse])
y_pred_mse_clip = np.clip(y_pred_mse, 0.0, None)
metrics = {
"MSE (baseline linear)": mean_squared_error(y, y_pred_mse),
"MSE (MSLE-fit model)": mean_squared_error(y, y_pred_msle),
"MSLE (baseline linear, clipped)": mean_squared_log_error(y, y_pred_mse_clip),
"MSLE (MSLE-fit model)": mean_squared_log_error(y, y_pred_msle),
}
metrics
{'MSE (baseline linear)': 8020.024141455282,
'MSE (MSLE-fit model)': 3214.396742264166,
'MSLE (baseline linear, clipped)': 1.0271819896986292,
'MSLE (MSLE-fit model)': 0.128488167880701}
# Visual comparison
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="data", opacity=0.65))
fig.add_trace(go.Scatter(x=x_line, y=y_line_msle, mode="lines", name="MSLE-fit: exp(linear)-1"))
fig.add_trace(go.Scatter(x=x_line, y=y_line_mse, mode="lines", name="MSE baseline: linear"))
fig.update_layout(
title="MSLE-fit model tracks multiplicative structure; MSE-linear baseline can underfit small values",
xaxis_title="x",
yaxis_title="y",
)
fig.show()
# Loss landscape in (b0, b1) for the log-space objective (convex bowl)
b0_grid = np.linspace(b0_ols - 0.8, b0_ols + 0.8, 140)
b1_grid = np.linspace(b1_ols - 0.8, b1_ols + 0.8, 140)
B0, B1 = np.meshgrid(b0_grid, b1_grid)
residuals = z[None, None, :] - (B0[:, :, None] + B1[:, :, None] * x[None, None, :])
Z = np.mean(residuals**2, axis=2)
stride = max(1, len(hist["step"]) // 120)
b0_path = np.array(hist["b0"])[::stride]
b1_path = np.array(hist["b1"])[::stride]
fig = go.Figure()
fig.add_trace(
go.Contour(
x=b0_grid,
y=b1_grid,
z=Z,
contours_coloring="heatmap",
colorbar=dict(title="MSLE"),
)
)
fig.add_trace(
go.Scatter(
x=b0_path,
y=b1_path,
mode="lines+markers",
name="GD path",
marker=dict(size=4, color="black"),
line=dict(color="black"),
)
)
fig.add_trace(
go.Scatter(
x=[b0_ols],
y=[b1_ols],
mode="markers",
name="OLS optimum",
marker=dict(symbol="x", size=10, color="white", line=dict(color="black", width=2)),
)
)
fig.update_layout(
title="MSLE landscape for log-space line: convex bowl + gradient descent trajectory",
xaxis_title="b0",
yaxis_title="b1",
)
fig.show()
8) Practical usage (scikit-learn)#
Scoring#
MSLE is available as:
from sklearn.metrics import mean_squared_log_error
In model selection, scikit-learn uses the (negated) scorer string:
neg_mean_squared_log_error
Training for MSLE#
A common pattern is to train a regressor on log1p(y) and invert with expm1.
In scikit-learn, TransformedTargetRegressor provides this neatly.
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
X_1d = x[:, None]
reg = TransformedTargetRegressor(
regressor=LinearRegression(),
func=np.log1p,
inverse_func=np.expm1,
)
reg.fit(X_1d, y)
y_pred = reg.predict(X_1d)
mean_squared_log_error(y, y_pred)
0.12848816788070097
Pros, cons, and when to use MSLE#
Pros#
Relative-error focus: penalizes multiplicative differences (good for targets spanning orders of magnitude)
Less dominated by large targets than MSE on the original scale (log compresses large values)
Handles zeros (thanks to
log(1+y))Symmetric for over/under prediction in ratio terms: \(r\) and \(1/r\) have the same per-sample loss
Cons#
Requires non-negative values: cannot be used if \(y\) (or predictions) can be negative
The +1 matters: near zero, MSLE behaves more like MSE and depends on the unit/scale implied by the
+1Can hide large absolute errors on large targets (by design)
Less directly interpretable than MAE/RMSE in the original units
Good use cases#
Demand/count forecasting, sales, traffic, inventory: targets are non-negative and often long-tailed
Problems where a 20% miss on a small value should matter similarly to a 20% miss on a large value
Situations where you expect multiplicative noise (log-normal style)
Common pitfalls / diagnostics#
If your model can output negatives (e.g., plain linear regression), MSLE will error; consider:
predicting in log-space and using
expm1, orusing a non-negative link function (e.g.,
softplus) or clipping (clipping changes the objective).
Plot residuals both in the original scale and in log-space; MSLE can look good while large absolute errors remain.
Exercises#
Show that MSLE is exactly MSE on \(z=\log1p(y)\).
Modify
mean_squared_log_error_npto return RMSLE (square root) as an option.Create a dataset with additive noise instead of multiplicative noise and compare models trained with MSE vs MSLE.
References#
scikit-learn:
sklearn.metrics.mean_squared_log_errorLog transform modeling pattern:
sklearn.compose.TransformedTargetRegressor